[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python#10273
Conversation
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #29796 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThese changes extend Python bindings for GenLlmReq and KVCacheManager C++ classes, add an environment variable to enable Python-based scheduling, and introduce a comprehensive Python scheduling framework with capacity and micro-batch scheduling policies as an alternative to C++ scheduler components. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Executor as Executor
participant Sched as SimpleUnifiedScheduler
participant Capacity as PyCapacityScheduler
participant MicroBatch as PyMicroBatchScheduler
participant KVCache as KVCacheManager
participant Policy as SchedulerPolicy
Executor->>Sched: schedule(pending_requests, running_requests, kv_cache_manager)
activate Sched
Sched->>Capacity: schedule(pending, running, kv_cache_manager)
activate Capacity
Capacity->>Policy: get_new_request_ids(pending)
activate Policy
Policy->>Capacity: filtered_request_ids
deactivate Policy
loop For each candidate request
Capacity->>KVCache: find_new_context_block(unique_tokens, request)
KVCache->>Capacity: context_block_info
Capacity->>Capacity: fit_request_to_blocks()
end
Capacity->>Sched: scheduled_requests, paused_requests
deactivate Capacity
Sched->>MicroBatch: schedule(scheduled_requests, kv_cache_manager)
activate MicroBatch
MicroBatch->>MicroBatch: compute_chunk_sizes()
rect rgb(200, 220, 255)
note right of MicroBatch: Encoder phase
MicroBatch->>KVCache: scheduling_has_free_blocks()
KVCache->>MicroBatch: has_free
end
rect rgb(220, 240, 220)
note right of MicroBatch: Context phase
MicroBatch->>MicroBatch: select_requests_for_context()
end
rect rgb(255, 240, 200)
note right of MicroBatch: Generation phase
MicroBatch->>MicroBatch: select_requests_for_generation()
end
MicroBatch->>Sched: SchedulerOutput (batches, tokens)
deactivate MicroBatch
Sched->>Executor: SchedulerOutput
deactivate Sched
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
PR_Github #31953 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #31985 [ run ] triggered by Bot. Commit: |
|
PR_Github #31985 [ run ] completed with state |
Funatiq
left a comment
There was a problem hiding this comment.
Great work. I haven't look at the implementation in detail yet. Two suggestions:
- I think we should add a simple test to CI so that we don't break the functionality by accident. E.g. run
test_overlap_scheduler.pywith the Python scheduler too. - Can we run a performance check on a smaller model like Llama-3.2-1B? The overhead should be more significant there. Ideally we should add NVTX ranges and collect nsys profiles to isolate the differences in execution time for the scheduler.
Thanks for the review! @Funatiq
|
|
Since you already have nsys profiles, could you report what the runtime for only the |
Sure, this image shows the result mentioned above(nsys reports different schedule time for each iteration, so we only recorded the avgs/medians). Detailes are in: https://docs.google.com/document/d/1he4S6hzDBApMGp2Bl5PTED-hcKaRmXDZbOi9EgJlK5A/edit?tab=t.0 |
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #32429 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #32433 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #32442 [ run ] triggered by Bot. Commit: |
|
PR_Github #32442 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #32456 [ run ] triggered by Bot. Commit: |
|
@Funatiq Hi, I added an E2E benchmark on Llama-3.2-1B as suggested. The hostoverhead also seems acceptable on Llama-3.2-1B:
|
|
PR_Github #32456 [ run ] completed with state |
|
Thanks for the benchmarks. Could you add a short summary to the PR description please? |
Sure, I have updated the PR description. I think this PR can be merged. |



As titled. This PR is our first step to refactor the scheduler.
Goal:
Deliverables:
The overhead of python scheduler seems to be acceptable even in scenarios where the host overhead is the bottleneck. The benchmark results are:
Details can be found in: Unified Python SPMD Scheduler Execution Plan & Performance Strategy